-
Notifications
You must be signed in to change notification settings - Fork 769
feat: nms op #4246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: nms op #4246
Conversation
cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests
Codecov Report❌ Patch coverage is ❌ Your project check has failed because the head coverage (68.88%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #4246 +/- ##
==========================================
+ Coverage 68.85% 68.88% +0.03%
==========================================
Files 1405 1407 +2
Lines 167686 167918 +232
==========================================
+ Hits 115456 115669 +213
- Misses 52230 52249 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
for context: I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself. I believe having the both versions available is ideal here, so users can choose which one fits their use case best |
|
The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though. I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well. |
|
one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Thanks for the accelerated NMS implementation.
Just some minor comments below.
For a GPU implementation, I think the benefits typically come with a batched NMS implementation, which involves some tricks. Most commonly, offset the box coordinates per class so it's applied independently (ref: torchvision batched_nms).
| // Filter by score threshold and convert to SoA format in one pass | ||
| let score_thresh = options.score_threshold; | ||
| let mut x1s = Vec::with_capacity(n_boxes); | ||
| let mut y1s = Vec::with_capacity(n_boxes); | ||
| let mut x2s = Vec::with_capacity(n_boxes); | ||
| let mut y2s = Vec::with_capacity(n_boxes); | ||
| let mut areas = Vec::with_capacity(n_boxes); | ||
| let mut filtered_scores = Vec::with_capacity(n_boxes); | ||
| let mut original_indices = Vec::with_capacity(n_boxes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of allocating 5 separate Vec<f32>s, it would likely be better to have a single vec for filtered boxes and then use a single contiguous "scratchpad" for the filtered box data to reduce this into a single heap allocation:
// After filtering and sorting, we know the number of filtered boxes `n_filtered`
let mut scratchpad = vec![0.0f32; n_filtered * 5];
// Split into slices
let (x1s, rest) = scratchpad.split_at_mut(n_filtered);
let (y1s, rest) = rest.split_at_mut(n_filtered);
let (x2s, rest) = rest.split_at_mut(n_filtered);
let (y2s, areas) = rest.split_at_mut(n_filtered);
for (i, b) in filtered.iter().enumerate() {
// Fill the slices with the filtered box coordinates and areas
}And the previous filtering can simply be applied on the indices instead:
let mut filtered_indices = Vec::with_capacity(n_boxes);
for i in 0..n_boxes {
let score = scores_vec[i];
if score >= options.score_threshold {
filtered_indices.push(i); // original index
}
}
let n_filtered = filtered_indices.len();
if n_filtered == 0 {
return vec![];
}
// Note: I used total_cmp instead (which should handle NaNs instead of panicking)
filtered_indices.sort_by(|&a, &b| scores_vec[b].total_cmp(&scores_vec[a]));This also makes it easier to implement aligned loads in the future if desired. Even without that, the single contiguous allocation should already be an improvement I believe.
You could benchmark both implementations to see if it's actually worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It made for a nice improvement!
Unaligned:
| Benchmark | Burn Version | Shapes | Feature | Backend | Device | Median |
|---|---|---|---|---|---|---|
| nms-n100-f32 | local | [(100, 4)(100)] | ndarray | ndarray |
Cpu | 17.000µs |
| nms-n1000-f32 | local | [(1000, 4)(1000)] | ndarray | ndarray |
Cpu | 438.000µs |
| nms-n10000-f32 | local | [(10000, 4)(10000)] | ndarray | ndarray |
Cpu | 21.385ms |
| nms-n5000-f32 | local | [(5000, 4)(5000)] | ndarray | ndarray |
Cpu | 6.785ms |
Aligned:
| Benchmark | Burn Version | Shapes | Feature | Backend | Device | Median |
|---|---|---|---|---|---|---|
| nms-n100-f32 | local | [(100, 4)(100)] | ndarray | ndarray |
Cpu | 16.000µs |
| nms-n1000-f32 | local | [(1000, 4)(1000)] | ndarray | ndarray |
Cpu | 373.000µs |
| nms-n10000-f32 | local | [(10000, 4)(10000)] | ndarray | ndarray |
Cpu | 17.337ms |
| nms-n5000-f32 | local | [(5000, 4)(5000)] | ndarray | ndarray |
Cpu | 5.460ms |
With all_suppressed optimization:
| Benchmark | Burn Version | Shapes | Feature | Backend | Device | Median |
|---|---|---|---|---|---|---|
| nms-n100-f32 | local | [(100, 4)(100)] | ndarray | ndarray |
Cpu | 15.000µs |
| nms-n1000-f32 | local | [(1000, 4)(1000)] | ndarray | ndarray |
Cpu | 276.000µs |
| nms-n10000-f32 | local | [(10000, 4)(10000)] | ndarray | ndarray |
Cpu | 11.933ms |
| nms-n5000-f32 | local | [(5000, 4)(5000)] | ndarray | ndarray |
Cpu | 3.692ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reporting the benchmarks! That's awesome.
Assuming the "unaligned" results are from the previous version, I think the biggest jump in "aligned" likely comes from the single vec allocation (for reads). But the aligned vec is a nice addition on top.
| /// Non-Maximum Suppression options. | ||
| #[derive(Clone, Copy, Debug)] | ||
| pub struct NmsOptions { | ||
| /// IoU threshold for suppression (default: 0.5). | ||
| /// Boxes with IoU > threshold with a higher-scoring box are suppressed. | ||
| pub iou_threshold: f32, | ||
| /// Score threshold to filter boxes before NMS (default: 0.0, i.e., no filtering). | ||
| /// Boxes with score < score_threshold are discarded. | ||
| pub score_threshold: f32, | ||
| /// Maximum number of boxes to keep (0 = unlimited). | ||
| pub max_output_boxes: usize, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future notes: might be useful to extend box format to [cx, cy, w, h] on top of the current [x1, y1, x2, y2] format.
Thanks! After squeezing out some more performance, I'm happy with the state of the PR, but I think it's probably worthwhile to circle back to the kernel in the future, possibly when adding batch_nms as you say. |
laggui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments!
LGTM, pending one small change I believe is still required.
Approving in advance nonetheless.
| match lanes { | ||
| 4 => *(suppressed.as_ptr().add(i) as *const u32) == 0x01010101, | ||
| 8 => *(suppressed.as_ptr().add(i) as *const u64) == 0x0101010101010101, | ||
| 16 => { | ||
| *(suppressed.as_ptr().add(i) as *const u128) | ||
| == 0x01010101010101010101010101010101 | ||
| } | ||
| _ => unreachable!(), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This casts a *const bool (1-byte aligned) to *const u32/u64/u128 without proper alignment. In practice I think it will work on modern platforms that typically over-align (as I've discovered in the past), but still technically UB.
I think an easy fix would be to simply read unaligned, e.g.
suppressed.as_ptr().add(i).cast::<u32>().read_unaligned()
Pull Request Template
Checklist
cargo run-checkscommand has been executed.Changes
Adds an
nmsop to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.Testing
I compared outputs from torchvision and this implementation and confirm they match with the same settings.
Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.